Analysis of 2019 Crime statistics in the USA

Sonia Agarwal CS544(Spring1 O1) 02/24/2022

Dataset Overview

The following data set is looking at 2019 crime statistics in the USA. This data set is retrieved from KD nuggets. . Since 1930, participating local, county, state, tribal, and federal law enforcement agencies have voluntarily provided the nation with a reliable set of crime statistics through the Uniform Crime Reporting (UCR) Program. The FBI, which administers the program, periodically releases the crime statistics to the public.The data includes total number of arrests for all classes state wise, violent crime, property crime, rape, robbery, burglary, motor vehicle theft, fraud, weapon carrying possessing,drunkenness. It also include the 2019 estimated population state wise. There was significant analysis looking at the break down of age if the suspect was under age 18.

Goal of Analysis

The goal of the analysis is to get a better understanding of the type of crimes that happened in year 2019. Which state in US has more arrest for respective crime and separate states. What type of crime is happening more to make good strategy plans to reduce the crime rate. Some questions to ask, are there crimes rates identified by age below 18. What does the data show about types of crime rates? Where peoples need to take more precautions to save themselves being a victim.? Where police need to pay more attention to control these crime rates.?IS there any correlation between drug abuse violation and violent crime? Is there any correlation between property crime and violent crimes? Which crime is the Highest reported for respective state with percentage? which state has more reported arrests for age under 18 and age 18 and over?

Data Prep

The data provided held a lot of valuable information. As I mentioned earlier the data was downloaded from KD nuggets in excel format and data had two classification under 18 and all ages.I transformed the data in two different classifications named as “classification” “under 18” and “18 and above”. Also for second classification name had NA so I replaced name with the corresponding state name and saved in “csv” format. I saved file as “Arrests_by_State_2019.csv in same working directory. Then I created the project and imported the data using read.csv function.Often new data frames or vectors were created to store this information. This was so that multiple parts of the analysis could use the newly added data frame. The main data set was stored into a data frame called 2019 crime statistics in USA, which is used throughout the analysis. library(plotly),library(dplyr),library(UsingR),library(sampling).

Number of Arrests(all classes)

The first variable analyzed was the number of Arrests for all crimes for two age groups, “under 18” and “18 and above”, state wise. I created the data frame using the main data and used three columns, State, Classification and Total_all_classes. Then I used plotly to plot the bar plot.

Plot shows the number of arrests in decreasing order state wise and two bars of under 18 and 18 and above for same state. The graphs is skewed to the right which means the distribution is more distanced from the mean value. Which means the crime rate is decreasing for the states to the right side of the graph. California, Florida and Texas are the top 3 states having most number of arrest cases reported for the age group 18 and above.

x<-list(categoryorder = "total descending")
y<-list(title="Total Arrests(all classes)")

df1 <- data[c("State", "Classification", "Total_All_Classes")]
p1 <- df1 %>%
  plot_ly(x = ~State, 
              y=~Total_All_Classes, 
          color=~Classification,
          #name = '2019 Total Arrests(All classes)', 
          type = 'bar',
          text = ~Classification, 
          texttemplate = '%{y:.2s}', 
          textposition = 'outside',
          width = 1000
          ) %>%
  layout(xaxis = x, yaxis= y, barmode = "group")
p1

Distribution of Violent crime

Violent Crime is composed of four offenses: murder and non negligent manslaughter, rape, robbery, and aggravated assault. Violent crimes are defined in the UCR Program as those offenses that involve force or threat of force.The Violent Crime section of this report provides more information about violent crime and an overview of violent crime data for 2019.

Box plot distribution under Violent Crime

This distribution compairs the Violent crime with other four crime under its catrgoty.

# ############box plot
plot_ly(data, y = ~Violent_Crime
 , type="box", name = 'Violent_Crime') %>%
  add_trace(y = ~Robbery , name = 'Robbery ') %>%
  add_trace(y =  ~Rape, name = 'Rape') %>%
  add_trace(y =~Aggravated_Aassault , name = 'Aggravated_Aassault ') %>%
  add_trace(y =~Murder_and_nonnegligent_manslaughter
 , name = 'Murder and nnmans') %>%
  layout(xaxis = list(title = 'Arrests',xlim=c(0,40000))) -> p

p

Distribution of Total Arrests(all classes) State wise For Age below 18

Plot shows the number of arrests in decreasing order state wise for age under 18 . The graphs is skewed to the right which means the distribution is more distanced from the mean value. Graph has exponential distribution Which means the crime rate is decreasing for the states to the right side of the graph for age under 18 Texas,Florida and California, are the top 3 states having most number of arrest cases reported for the age group under 18.

x<-list(categoryorder = "total descending")
y<-list(title="Total Arrests(all classes)")
df_Under18<-data[seq(1, 101, by = 2),]
df_18andabove<-data[seq(2, 102, by = 2),]
df_Under18 <- df_Under18[c("State", "Classification", "Total_All_Classes")]
p2 <- df_Under18 %>%
  plot_ly(x = ~State, 
              y=~Total_All_Classes, 
          color=~Classification,
          name = '2019 Total Arrests(All classes) under 18', 
          type = 'bar',
          text = ~Classification, 
          texttemplate = '%{y:.2s}', 
          textposition = 'outside',
          width = 1000
          ) %>%
  layout(xaxis = x, yaxis= y, barmode = "bar")
p2

Central Limit Theorem For Total Arrests(all classes) for Age under18

A large standard deviation indicates that the data points are far from the population mean.The higher the standard deviation, Less will be Arrest rate. I used the population as total arrests for under age 18 Set the start seed for random numbers as 1234.

Population Mean [1] 10403.55 Population SD [1] 10848.28

population<-df_Under18$Total_All_Classes
population_mean<-mean(population);population_mean
## [1] 10403.55
population_sd<-sd(population);population_sd
## [1] 10848.28
pop<-rnorm(population, mean = population_mean, sd = population_sd)

n1=population
hist(pop,col = "brown",
        main ="Histogram of Population",
        xlab = "Arrests",labels=TRUE)
   abline(v = mean(n1), col = "Red")
 abline(h=0)

### Central Limit Theorem For Total Arrests(all classes) for Age under18 The Central Limit Theorem states that the distribution of the sample means for a given sample size of the population has the shape of the normal distribution. The theorem is shown with various distributions of the input data in the following sections.” In other words, as the sample size gets larger, the means of the samples become a normal distribution. This was tested against the age distributions of arrests for age under18. Below is a figure showing the distributions of 5000 random samples of sample sizes of 10, 20, 30, and 40 with replacement is true and showed that the applicability of the Central Limit Theorem for this variable.With samples sizes of 10, 20, 30, and 40 histogram of sample means is verified.

df_Under18<-data[seq(1, 101, by = 2),]
df_18andabove<-data[seq(2, 102, by = 2),]
df_Under18 <- df_Under18[c("State", "Classification", "Total_All_Classes")]
population<-df_Under18$Total_All_Classes
population_mean<-mean(population);population_mean
## [1] 10403.55
population_sd<-sd(population);population_sd
## [1] 10848.28
par(mfrow=c(1,1))

sampleSizes <- c(10,20,30,40)

set.seed(1234)

col =c("lightgreen","cyan","red","yellow")

n=530581

for (size in sampleSizes) {
  set.seed(1234)
  s <- c()

for (i in 1:500) {
 
  s[i] <- mean(sample(population,size, replace = T))
}


hist(s, col =col[which(size == sampleSizes)], main=paste("Histogram of Sample Means of Arrests (size = ",size,")")
,xlab = "Arrests",ylim = c(0,600),labels = TRUE)
abline(v = mean(s), col = "Red")
abline(v = mean(population), col = "blue")

}

 par(mfrow=c(1,1))

Sampling

Sampling is used to identify and analyze any trends or patterns that can be seen in a subset of a larger group of data. It can also be useful technique to help predict some type of data or information. There are many different types of sampling that can be applied to data. The sampling methods used for this analysis are simple random sampling without replacement, systematic, and stratified. The sampling was specifically looking at number of arrests under 18(Total all classes). Simple random sampling is when a specified sample is selected from the larger group or larger frame. Each person or object has an equivalent opportunity of getting selected. An example is if we have 4 girls,3boys and 6 old persons and simple random sampling were to be used to pick a sample size of 5 any five the group of 13 could be picked. The same is being done in the analysis of Total number of arrests under 18(Total all classes). The sample size of 50 is being used. Out of the population of 10000, there will be 50 randomly selected without replacement. Another technique that was used to sample the data was stratified sampling. Stratified sampling is when the larger group of data is broken into smaller groups and then certain sizes are picked from each group. In this analysis, the races are broken into each group and random “individuals” are selected from each group. The final technique of sampling used for the analysis is systematic sampling. Systematic sampling is when there are “rules” decided to pick the sample size. For example, if there were 10 houses on 4 streets. The “rule” could be that the sampling will start with the 2nd house and then every 3rd house will be chosen. There is a potential bias with this type of sampling. For each one of these sampling techniques, a sample size of 50 was used. The first bar plot in the top left corner shows the break down from the entire population i.e., no sampling. The bar plot at the top right corner is showing simple random sampling without replacement. The bar plot on the bottom left is showing systematic sampling, and the one on the bottom right is showing stratified.

df_Under18<-data[seq(1, 101, by = 2),]
df_18andabove<-data[seq(2, 102, by = 2),]
df_Under18 <- df_Under18[c("State", "Classification", "Total_All_Classes")]
population<-df_Under18$Total_All_Classes
population_mean<-mean(population);population_mean
## [1] 10403.55
population_sd<-sd(population);population_sd
## [1] 10848.28
par(mfrow=c(1,1))


set.seed(1234)

#sample function s() with replace FALSE for drawing samples


s <- sample(population, 1000, replace = TRUE)

n1=10000; sample.size=20

for (i in 1:n1) {
 
  s[i] <- mean(sample(s,sample.size, replace = FALSE))
}


hist(s, col ="cyan", main=paste("Histogram of Sample Means of Arrests under 18 (size = ",sample.size,")")
,xlab = "Arrests",ylim = c(0,5000),labels = TRUE)
abline(v = mean(s), col = "Red")

#abline(v = mean(n1), col = "blue")



 par(mfrow=c(1,1))
 
 ################# # srswor
 
set.seed(1234)

n=50
#N=nrow(df_Under18$Total_All_Classes)

s <- srswor(50, 51)
sample.1 <- df_Under18[s != 0, ]

t3<-table(sample.1)

hist(t3, col ="yellow", main=paste("Histogram of Sample Means of Arrests under 18 ")
,xlab = "Arrests",ylim = c(0,5),labels = TRUE)

#abline(v = mean(s), col = "Red")

 
# systematic sampling
set.seed(1234)

N2 <- nrow(df_Under18$Total_All_Classes)
n2 <- 20

# items in each group
k <- ceiling(51 / 20)

# random item from first group
r <- sample(k, 1)

# select every kth item

seq(r, by = k, length = n)
##  [1]   2   5   8  11  14  17  20  23  26  29  32  35  38  41  44  47  50  53  56
## [20]  59  62  65  68  71  74  77  80  83  86  89  92  95  98 101 104 107 110 113
## [39] 116 119 122 125 128 131 134 137 140 143 146 149
sample.2 <- df_Under18[s, ]

t4 <- table(sample.2)

hist(t4, col ="red", main=paste("Histogram of Systematic Sample Means of Arrests under 18 ")
,xlab = "Arrests",ylim = c(0,5),labels = TRUE)

#abline(v = mean(s), col = "Red")

Violent Crime Vs Property Crime based on age classification State wise

Based on the graphs its look like there is relationship between property crime and Violent crime.

slope <- 2e-05

data$size <- sqrt(data$Total_All_Classes * slope)


data%>%
  plot_ly(
    x = ~Violent_Crime, 
    y = ~Property_Crime, 
    size = ~size, 
    sizes = c(min(data$size), max(data$size)),
    color=~Classification,
    colors="Set1",
    frame = ~State, 
    text = ~crime, 
    hoverinfo = "text",
    type = 'scatter',
    mode = 'markers'
  ) %>%
  layout(
    xaxis = list(
      type = "log"
      ))
#two  dimensional scatter plot

fig<- plot_ly(data, x = ~Violent_Crime, y = ~Property_Crime, type = 'scatter', mode = 'markers',
        text = ~paste('State ', State))

fig

3D Scatter Plot Violent Crime Vs Drug Abuse Violations

There is a positive correlation between violent crime and property crime but the data is more distributed as the values increases.Crime rate is increasing with the chances that person is on drugs.

plot_ly(data, x = ~Violent_Crime, y = ~Drug_abuse_violations
, z = ~Classification) %>%
  add_markers(color = ~State)

Pie Chart

This pie chart shows the percentages of the crimes in all sates of United States for year 2019.

library(dplyr)


transformed_pie_data<-t(data)
transformed_pie_data <- as.data.frame(t(data)[3:33,])
transformed_pie_df <- mutate_all(transformed_pie_data, function(x) as.numeric(as.character(x)))

transformed_pie_df <-transformed_pie_df %>%
  mutate(Total = rowSums(.))
crimes <- colnames( data[4:33])

transformed_pie_df <- transformed_pie_df %>%
  mutate(Perct = round( Total/transformed_pie_df$Total[1]*100, 2))
piedata <- data.frame ( Crime = crimes, Percent = c(transformed_pie_df$Perct[2:31]) )

fig1 <- plot_ly(piedata, labels = ~Crime, values = ~Percent, type = 'pie')
fig1 <- fig1 %>% layout(title = '2019 Total Arrests(All ages)in USA',
                      xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
                      yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

fig1

Conclusion

This analysis provides arrest data reported for each state for 2019. The table provides both total arrests and arrests of juveniles (persons under the age of 18).These data represent the number of persons arrested; however, some persons may be arrested more than once during a year. Therefore, the statistics in this table could, in some cases, represent multiple arrests of the same person. California state has the highest reported arrests cases and Florida is on second position.Population has really an impact on number of crime rates. Taxes has highest number of crime arrests under 18.Time to time consultant program help can be helpful to control the increasing crime arrests in Taxes for under 18 age group.

library(wordcloud)


transformed_wc_data <- as.data.frame(t(data)[3:33,])
transformed_wc_df <- mutate_all(transformed_wc_data, function(x) as.numeric(as.character(x)))

transformed_wc_df <-transformed_wc_df %>%
  mutate(Total = rowSums(.))

transformed_wc_df <- transformed_wc_df %>%
  mutate(Rank = rank(Total))


wordcloud(rownames(transformed_wc_df), freq = transformed_wc_df$Rank, random.order = FALSE,
          random.color = FALSE,
          rot.per = 0.15,
          scale = c(1,1),
          min.freq = 1,
          colors=brewer.pal(8, "Dark2"),
          vfont=c("sans serif","plain"))